Chinese Named Entity Identification Using Class-based Language Model1
نویسندگان
چکیده
We consider here the problem of Chinese named entity (NE) identification using statistical language model(LM). In this research, word segmentation and NE identification have been integrated into a unified framework that consists of several class-based language models. We also adopt a hierarchical structure for one of the LMs so that the nested entities in organization names can be identified. The evaluation on a large test set shows consistent improvements. Our experiments further demonstrate the improvement after seamlessly integrating with linguistic heuristic information, cache-based model and NE abbreviation identification. ,QWURGXFWLRQ 1( LGHQWLILFDWLRQ is the key technique in many applications such as information extraction, question answering, machine translation and so on. English NE identification has achieved a great success. However, for Chinese, NE identification is very different. There is no space to mark the word boundary and no standard definition of words in Chinese. The Chinese NE identification and word segmentation are interactional in nature. This paper presents a unified approach that integrates these two steps together using a class-based LM, and apply Viterbi search to select the global optimal solution. The class-based LM consists of two sub-models, namely the context model and the entity model. The context model estimates the probability of generating a NE given a certain context, and the entity model estimates the probability of a sequence of Chinese characters given a certain kind of NE. In this study, we are interested in three kinds of Chinese NE that are most commonly used, namely person name (PER), location name (LOC) and organization name (ORG). We have also adopted a variety of approaches to improving the LM. In addition, a hierarchical structure for organization LM is employed so that the nested PER, LOC in ORG can be identified. The evaluation is conducted on a large test set in which NEs have been manually tagged. The experiment result shows consistent improvements over existing methods. Our experiments further demonstrate the improvement after integrating with linguistic heuristic information, cache-based model and NE abbreviation identification. The precision of PER, LOC, ORG on the test set is 79.86%, 80.88%, 76.63%, respectively; and the recall is 87.29%, 82.46%, 56.54%, respectively. 5HODWHG :RUN
منابع مشابه
A Class-based Language Model Approach to Chinese Named Entity Identification
This paper presents a method of Chinese named entity (NE) identification using a class-based language model (LM). Our NE identification concentrates on three types of NEs, namely, personal names (PERs), location names (LOCs) and organization names (ORGs). Each type of NE is defined as a class. Our language model consists of two sub-models: (1) a set of entity models, each of which estimates the...
متن کاملChinese Named Entity Identification Using Class-based Language Model
We consider here the problem of Chinese named entity (NE) identification using statistical language model(LM). In this research, word segmentation and NE identification have been integrated into a unified framework that consists of several class-based language models. We also adopt a hierarchical structure for one of the LMs so that the nested entities in organization names can be identified. T...
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملNamed Entity Recognition in Persian Text using Deep Learning
Named entities recognition is a fundamental task in the field of natural language processing. It is also known as a subset of information extraction. The process of recognizing named entities aims at finding proper nouns in the text and classifying them into predetermined classes such as names of people, organizations, and places. In this paper, we propose a named entity recognizer which benefi...
متن کاملبهبود شناسایی موجودیتهای نامدار فارسی با استفاده از کسره اضافه
Named entity recognition is a process in which the people’s names, name of places (cities, countries, seas, etc.) and organizations (public and private companies, international institutions, etc.), date, currency and percentages in a text are identified. Named entity recognition plays an important role in many NLP tasks such as semantic role labeling, question answering, summarization, machine ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002